Page Layout Classification Technique for Biomedical Documents

نویسندگان

Daniel X. Le

George R. Thoma

چکیده

The structural layout information of scanned document pages is valuable for a wide range of document processing applications such as automatic document searching, document delivery and automated data entry. This paper describes the classification of scanned document pages into different classes of physical layout structures. The page layout classification technique proposed in this paper uses a combination of geometry-based and content-based zone features calculated from optical character recognition (OCR) output. Geometry-based and content-based features are derived from geometric zone information and zone contents respectively. A new feature called “single and multiple column zone vertical area string pattern” is also proposed to normalize document image pages. After normalizing document pages, a template matching algorithm calculates similarity classification features by matching vertical area string patterns of document pages to those of predefined layout document structures. Similarity classification features and both geometry-based and content-based zone features are then input into a rule-based learning system for the final decision on the page layout classification structure. The performance of our document page layout classification scheme has been evaluated using a sample size of several hundred images of biomedical journal pages. Preliminary evaluation results show that our approach is capable of classifying journal pages into different classes of physical layout structures at an accuracy of more than 96 %.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Document page similarity based on layout visual saliency: Application to query by example and document classification

In this paper we propose to define a measure of visual similarity to compare different pages in a corpus. This measure is based on the analysis of the visual layout saliency of the page composition. This similarity is computed using both the document layout and characteristics of the text itself. The text characterization uses statistical features derived from textural primitives. Our purpose i...

متن کامل

An Algorithm for Finding Maximal Whitespace Rectangles at Arbitrary Orientations for Document Layout Analysis

The analysis of the background structure (whitespace) of page images has become an important technique for physical document layout analysis. Globally maximal whitespace rectangles have been previously demonstrated to constitute a concise representation of the major layout features of documents. However, previous methods for computing maximal whitespace rectangles were limited to axisaligned re...

متن کامل

Page Classification for Meta-data Extraction from Digital Collections

Automatic extraction of meta-data from collections of scanned documents (books and journals) is a useful task in order to increase the accessibility of these digital collections. In order to improve the extraction of meta-data, the classification of the page layout into a set of pre-defined classes can be helpful. In this paper we describe a method for classifying document images on the basis o...

متن کامل

Document page similarity based on layout visual saliency: application to query by example and document classificat - Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on

متن کامل

STAN: Structural Analysis for Web Documents

In this paper we present STAN, a structural analysis tool used for classifying web documents while at the same time extracting meaningful information from them. The extraction and classification rules are defined in terms of a structrural grammar operating on both layout properties and content properties of the document. Stan was designed to accept HTML as input and is able to process documents...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

Page Layout Classification Technique for Biomedical Documents

نویسندگان

چکیده

منابع مشابه

Document page similarity based on layout visual saliency: Application to query by example and document classification

An Algorithm for Finding Maximal Whitespace Rectangles at Arbitrary Orientations for Document Layout Analysis

Page Classification for Meta-data Extraction from Digital Collections

Document page similarity based on layout visual saliency: application to query by example and document classificat - Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on

STAN: Structural Analysis for Web Documents

عنوان ژورنال:

اشتراک گذاری